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ABSTRACT 

We consider the problem of statically assigning many tasks to a (smaller) system of homogeneous 
processors, where a task’s structure is modeled as a branching process, and all tasks are assumed 
to have identical behavior. We show how the theory of majorization can be used to obtain a partial 
order among possible task assignments. Our results show that if the vector of numbers of tasks 
assigned to each processor under one mapping is majorized by that of another mapping, then the 
former mapping is better than the latter with respect to a large number of objective functions. In 
particular, we show how measurements of finishing time, resource utilization, and reliability are all 
captured by the theory. We also show how the theory may be applied to the problem of partitioning 
a pool of processors for distribution among parallelizable tasks. 
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1 Introduction 


Parallel processing has emerged as an important means of achieving high computational perfor- 
mance. As a consequence, much research interest has been sparked in the area of efficient use of 
parallel computers. I he problem of assigning tasks among processors to minimize processing time 
has already received considerable attention in the literature, e.g., [3, 4, 8, 9, 12, 18]. We consider 
the problem of statically assigning tasks to processors when the tasks have unknown random pro- 
cessing times and a certain type of stochastic structure. The structure we examine embodies the 
notion of one task spawning a set of others; we examine static assignments, under the assumption 
that all offspring of a task are executed on the same processor as the task. Static assignment is 
likely to be used when a task’s state is large, thereby making dynamic assignment very costly in 
terms of communication. 


This paper examines theoretical issues associated with comparing different static mappings of 
a set of complex stochastic tasks. In particular, we show how the theory of majorizotion can be 
used to derive strong results concerning the comparison of different mappings. The strength of 
oui < ontiibution lies in our providing a formal underpinning to the analysis of mapping complex 
stochastic tasks and to the optimization of a rich class of objective functions. 

Previous work on load balancing or task assignment [4, 4, 7, <8, 9, 12, 18] in parallel systems 
may be loosely divided into three categories. The first category, with deterministic structure, 
involves task structures and execution times which are known prior to assignment. In this case 
[14] includes a study of problem complexity under various constraints and heuristic algorithms for 
task s< heduling. A second class of load balancing formulations, in which task execution times are 
random, is characterized by queueing-theoretic considerations [4, 16, 18]. Much of this work pertains 
to steady-state expectations of task delays with state-dependent [4, 18] and state-independent [16] 
assignment policies. Our work is closest to the third category [7, 8, 9, 13] which also takes task 
execution times to be random but focuses on minimizing expected processing times for a fixed set 
of tasks. As discussed in [9], the assumption of random execution times and a given set of tasks is 
justified in applications such as Monte-Carlo simulations. 

Our approach to the problem differs from previous work [7, 8, 9, 13] in several ways. In 
this paper, we do not concern ourselves with the explicit optimization of task assignment, but 
rather, with the comparison between different assignments over a wide range of possible objective 
functions. 1 ast approaches typically address the question: given A processors and m tasks with 
random execution requirements, find the assignment of tasks to processors that minimizes the 
expected maximum workload (or makespan). In this paper, we address a related question: given 
two assignments, when can we say that one is “better” than the other, and for what class of 
objective functions can we make this assertion? Our results have a simple general form. We can 
describe a. mapping <d probabilistically homogeneous tasks to processors by a vector in whose 
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ith component is the number of tasks assigned to the ith processor. Let m and m! describe two 
different mappings. Then if m can be bounded by in' using the notion of majorization [10] (written 
m -< m'), then for all objective functions / in a class C we may say that the assignment described 
by m is better than the assignment described by m' . The class C. is often quite general, and 
includes many commonly used objective functions, e.g., the expected maximum workload. We note 
that an interest in inequalities or stochastic orderings can be more useful than merely searching for 
optimal assignments, because such orderings may be derived in a variety of cases where it is too 
expensive to search for an optimal assignment. Inequalities are also useful when constraints on the 
assignment (e.g. heterogeneous memory capacity among processors) prohibit one from adopting 
an otherwise obvious optimal policy. We note that stochastic orderings are of independent interest 
[15] and also, in some of the cases we consider the optimal strategy is apparent from the derived 
ordering. 

Our interest in obtaining stochastic orderings also stems from the observation that they are often 
the only results available for small numbers of random variables and a wide variety of distributions. 
Consider the fact that in [8, 0] the results are asymptotic in at least one variable n or A . In fact, 
in [9], the results are only asymptotically correct in both the number of tasks n and the number 
of processors A'. These approaches are based on the use of the central limit theorem [8] and large 
deviation theory [9], which are among the few limit results available that hold for a variety of 
distributions. In contrast, our approach is concerned with finite (and possibly small) n and A and 
we make use of the theory of stochastic majorization [10]. Thus, while some of our results are not 
as strong (in terms of optimality) as those obtained from fundamental limit theorems, the accuracy 
of our results does not depend on the number of tasks or processors. 

We now discuss other specific differences between our work and past efforts. Our structural 
model of a single task is that of a branching process: a completing process spawns a random number 
of subprocesses. This type of behavior appears in diverse applications such as Bran ch-and- Bound 
searching algorithms [2] where the branching structure is obvious, and dynamic regridding algo- 
rithms in numerical computations [1] where sections of coarse grid serve as “processes” which give 
rise to “subprocesses” associated with finer grids. Furthermore, our results permit the analysis of 
much more complex objective functions than have typically been studied for stochastic task models. 
Our model differs significantly from those in [8, 9, 12]. The tasks in [9] were taken to be individ- 
ual independent and identically distributed (i.i.d.) samples drawn from a common distribution, 
and synchronization behavior is that of periodic global synchronization. In [8] a complex task is 
comprised of a fixed number of tasks with random i.i.d. execution times. However, the analyses 
in both [9] and [8] are concerned with overheads (e.g. synchronization and communication costs) 
that our model does not include. In some ways the present work resembles earlier results obtained 
under the assumption that the workload assigned to a processor causes the processor to behave 
as a Markov chain [18]. Like this earlier work, our new results show how the quality of a static 
assignment persists across numerous stochastic transformations of the workload. The model we 
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study in tli<' present paper is a. distinct improvement, over that in [13], as the stochastic behavior 
<>f a processor is now explicitly dependent on the volume of workload it contains. 

Other related research has been directed at computing the expected completion time for a 
single complex task with a possibly random acyclic structure [6, 17]. Another related publication 
[1 1] studies the problem of scheduling sub-tasks of a single task, where the sub-tasks form a tree. 
Lastly, an analytic study of load-balancing statistically homogeneous workload on a hypercube is 
pi cheated in [<], where the mean and variance of the difference between the load on a processor 
•uid the average load are derived. While past research has been concerned exclusively with a single 
task or a given set of tasks, wo also consider the joint assignment of multiple classes of tasks, where 
tasks in different classes have different probabilistic behaviors. 

0 1* * wo| k ‘ s 0,1 l<,s 'ilts from the study of stochastic majorization. The fundamental theory 

of majorization originates in the economic study of income distribution- a sort of “load” balancing. 
We believe majorization finds a natural application in the area of mapping parallel workload, and 
that one of our contributions is to demonstrate uses of this powerful theory in parallel processing. 
In this respect our work is similar to that in [3, 10]. In [3] the focus is on a new stochastic ordering 
based on the class of symmetric, convex and L-subadditive functions with applications to routing 
and designing processor speeds. The load balancing emphasis in [3] is on scheduling structurally 
simple tasks from a queue. Majorization in steady-state queue lengths of open queueing networks is 
studied in [Iff], in which orderings are parameterized by queue utilizations. In contrast, we use the 
established orderings in [10] to obtain inequalities among all generations of complex tasks under 
different static mappings of the initial tasks. 

I he rest of this paper is organized as follows. In the next section, we define basic notation 
and present our workload model; also, we discuss the different stochastic orderings to be used 
throughout the paper. Section §3 contains the fundamental orderings for workloads. Section §4 
discusses various objective functions of interest in parallel systems and Section §5 applies the theory 
to the problem of partitioning a pool of processors among a set of parallelizable tasks. Section §6 
smnmarizos our work. 


2 Preliminaries 

We now introduce our model of computation, important definitions and known results, and a 
iationale for using majorization to study the assignment problem. 

2.1 Workload and System Model 

We model the workload produced by a single task as a branching process [15, pp. 116-117], as 
follows. The task begins with a single work unit (WU) of computation. The WU is executed; upon 



its completion a random number of other WUs are created, and placed in the task’s work list. 
The initial WU can thus be thought of as containing the “seeds” for a number of additional WUs, 
possibly zero, each of which similarly contain the seeds for additional WUs, and so on. One of 
the first generation WUs may then be executed, and its children (which are generation WUs) 
spawned and placed in the task’s work list. The number of children a WU spawns is assumed to lie 
random, chosen from a probability distribution known as the branching distribution. The process 
is repeated until the task’s work list is empty. The task workload is comprised of all computation 
related to all WUs ultimately descended from the initial task WU. 

We assume that the order of WU execution in no way affects the spawning of children: a WU in 
the work list is destined to spawn some j children, regardless of the length of time it spends in the 
list. This is easily understood if one views the WU generation as reflecting some intrinsic structural 
property of the problem, e.g., the branching of a search tree. Because of this independence, every 
WU belongs to some “generation” which is independent of execution order. The initial WU is in 
generation 0; all children spawned by a generation 1 WU are in generation 2, and so on. 

We assume that a given WU may be executed with the same constant cost on any one of ft 
homogeneous processors, and that every WU is executed on the same processor as is its parent. 
Therefore, we map all computation associated with a task when we map the task’s initial WU. 

Consider the evolution of an initial task WU. Let N q denote the number of WUs in its q - th 
generation. The size of the q - th generation is given as 

N q -i 

iv,= £ z «. O 

;/ = l 

where iV 0 = 1 and where Z h „ is the number of WUs generated by the j-th WU in the fa - 0-th 
generation. We assume that {Zj, q , 1 < q < I <}% i is -a sequence of independent and identically 
distributed (i.i.d.) random variables (r.v.’s). The following notation will be employed: 


K the number of processors. 
n - the number of initial task WUs. 

77 i an integer assignment vector whose i^ 1 component gives 
to the i th processor. 


the number of WUs assigned 


N„ the size of gc 


^e of generation r/, descended from a single initial WU (when the branching 


distribution is understood). For any subset A C IV, S A is ‘he sum of all sizes of generations 


Ha = 
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• / (j ) - the j th con volution of a probability mass function /. If X is a random variable, we will 
also use X^ to denote a sum of j independent instances of X . 

• W q {m) - the random vector of generation q WUs resulting from assignment vector m: 

We denote the i th component of W q (m) by (W g (m)) r The notation is extended to arbitrary 
subsets A C IN by 

WAm) = SS«»). 

The theory we develop permits us to compare different mappings under a variety of objective 
functions 0 : R h — ► R. Our results focus on comparing values of E[<j){W q (m))\ by deriving 
conditions for inequalities involving initial task assignments m. Most of these are of the following 
form: given two assignments m and m! where m -< m ' (see Definition 2.1), then E [<j)( W ^(m))] < 
E[(j){W a{^))\ for all subsets A C iV, when the expectations exist. 

Applicable functions 0 include any symmetric convex function; the maximum operator, all 
powers of the maximum, the sum operator, and the product operator are of particular interest. 
Thus a single comparison between the assignment vectors m and m! vectors can yield a wealth of 
information about the comparative behaviors of complex stochastic tasks under the two mappings. 

Our results are applicable to two different types of processor synchronization. We study gen- 
erational synchronization (CIS) where processors engage in a barrier synchronization between each 
WU generation. A processor executes all WUs of a given generation, say q , then synchronizes at 
the barrier. It is not released until all processors have executed all their generation q WUs and 
reached the barrier. The process repeats for subsequent generations. This type of synchronization 
is appropriate when the computation for a generation q in one task may depend on results computed 
by a generation q— 1 WU in another task. We also study termination synchronization (TS), where 
a processor engages in a barrier synchronization only after the work lists of all its initial tasks are 
empty. This is appropriate when the tasks are independent of each other, and the synchronization 
serves only to aggregate the final results of their respective computations. 

Not surprisingly, the optimal way of assigning n tasks to K processors is usually to assign n/P 
to each. In the face of the obvious one may well ask why we study partial orderings. Primarily, the 
theory proves the optimality with respect to a large number of objective functions, thereby lending 
theoretical support to intuition. Secondly, the theory works even in the presence of constraints 
that disallow the uniform assignment, and complicate one’s intuition concerning optimality. For 
example, memory constraints may exist that forbid one or more processors from being assigned more 
than n/P tasks. The theory identifies the optimal assignment under heterogeneous constraints. 

We will also apply these concepts to the issue of partitioning a pool of processors among a 
set of complex parallelizable tasks. Here we’ll take K to the be number of parallelizable tasks, 


and use rrt to describe the number of processors assigned to each. Constraints on feasible m 
are easily envisaged, as the assignment may need to consider “natural” partition sizes that arise 
from communication topology, or system usage at the time of the assignment. So again, while the 
optimal solution to the constraint-free version of the problem may be apparent, the theory provides 
a means of comparing feasible solutions. 

2*2 Stochastic Ordering and Majorization 

We now introduce the majorization partial ordering -< using notation largely taken from [10]. 
Definition 2.1 (majorization) .4 vector x is majorized by vector y, written as x -< y, iff 

l - Yn = \ a: [i] < ULl k = 1 ’ • • ' ' n ~ C 

£i=i a: [«] = E?-1 - 

where the notation xpj is taken to be the i-th largest element of x. 

Definition 2.2 (Schur-convex function) A function <f> : R n — ► R is said to be Schur-convex if 
x -< y in R n implies (j){x ) < (j){y) in JR. 

Examples of Schur-convex functions include (j>(x) = maxx 2 and (/>(x) = ^2g(x t ) y V convex g : R — 
R. 

Let Co be the class of increasing functions from R n onto R . The well-known stochastic ordering 
between random variables [15] is defined as follows. For random vectors X and Y with distribution 
functions F and G respectively, 

x <si y iff f </>(x)dF(x) < [ <f>(x)dG(x) V<f> € C 0 

J R \ JR. 

such that the integrals are well defined. Majorization over deterministic quantities is extended to 
random variables in like manner by using an appropriate class of functions: 

C] = (.sex} = { / : R n — ► R.\ f Schur-convex }, 

C 2 = {cas} — {/ : R n — ► R \ f convex and symmetric }. 

These define respectively the Schur-convex ])artial ordering, denoted by -< scx and the convex sym- 
metric partial ordering, denoted by -< Cfl5 (the notation and <e 2 used in [10]). Thus, 
X -<„* Y iff 

f <Kx)dF(x) < f <p(x)dG(x) V<p e C, 
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and X -^ oas Y ifl 

j R n <Kx)dF(x) < j Rn (j>(x)d(i(x) \/<j> e C 2 . 

Note that C -2 C C\ and thus, -< scx is a stronger ordering than -< Crt5 . 

Stochastic orderings based on likelihood ratio play an especially important role in this paper. 
Consider non-negative integer valued r.v.’s X and Y with probability mass functions / and g . 

Definition 2.3 (likelihood ratio) A is defined to be smaller than Y in likelihood ratio , written 
as X <( r Y , iff 

fi m ) / f(n) n ^ / 

“7 — r S } 7 0 < n < m, n, m £ iV. 
g(m) g(n) 

Anothei important property for a probability distribution is known as increasing likelihood ratio . 

Definition 2.4 (ILR) The non-negative integer valued r.v. X is said to have increasing likelihood 
ratio (ILR) (and its probability mass function f is said to be ILR) iff 

c \ + ^ <lr c 2 + X, whenever 0 < c\ < c 2 . 

Next we define another class of probability mass functions, those which have increasing likelihood 
ratio under convolution. 

Definition 2.5 (ILRC) Let f be a probability mass function defined on IN . f is said to have 
increasing likelihood under convolution (ILRC) iff /(*) < lr fU) whenever i < j. 

ILR distributions are known to be closed under convolution, even when the number of times con- 
volution is applied is random (provided the distribution of this number is also ILR) [10]. 

Lemma 2.1 Let f be an ILR probability mass function. Then 

• / is ILRC. 

• For any fixed integer k > 0, /(*) is ILR. 

• Let N be an ILR positive integer-valued random variable. Then /( N ) is ILR. 

Using these facts it is straightforward to prove the following. 

Lemma 2.2 Let f be an ILR probability mass function. Then 
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• If f is the brandling distribution for a task, then for all generations q, N q is ILR. 

• For any subset A C IV, if S a = Eig A N i has finite mean, then S A is ILRC. 

Proof: The proof of the first claim is a simple induction on q that uses closure of the ILR property 

under random ILR mixtures; the proof of the second rewrites as N q + c,, where q is the least 
element of A , and c, < c.j almost surely whenever i < j. The result follows from Definition 2.4 and 
the fact that N q is ILR. ® 

As we will see, the assumption of an ILR branching distribution often yields -< scx orderings. 
The ILR condition is true of the discrete Uniform, Poisson, Geometric and Binomial distributions, 
showing that our results apply when the branching assumes some well-known distributions. 

Next we show how these stochastic orderings may be used to develop stochastic majorizations 
between different static mappings. 


3 Branching and Stochastic Majorization 

In this section we establish conditions under which either -4 C as or ~^scx orderings can be established 
between “workload” vectors under different mappings. The notion of workload will be seen to 
be cpiite general. Throughout this section it is important to remember that the results relate to 
intrinsic properties of branching behavior, and do not depend on assumptions about execution 
behavior, e.g., synchronization. 

Our results for the -< scx ordering is based on the following theorem which is an application of 
Theorem 3.J.2 in [10]. The correspondence between our form and the original is pointed out in the 
Appendix. 

Theorem 3.1 Let f be an ILRC probability mass function, let m = (mi, . . . , m/c) be a vector of 
nonnegative integers, and for each j = 1, . . . , A let X be a r.v. with distribution p J K Suppose 
this se t of r.v.s is indepe ndent, and let f> : R K -*■ R be a Schur-convex function. Then 

7 (m)= E [<K* (mi \--.,V ( ” lK >)] 

is a Schur-convex function of m. 

Using Theorem T1 we obtain our basic -< scx ordering results. 

Theorem 3.2 Consider a set of n tasks , with common ILR branching distribution f , and let m 
and m f be two mapping vectors such that m X m l . Then 
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• For all generation s q, W ,,(m) -< S cx ( m )• 

• For any subset A C fV «ic/» tforf /m has finite mean, 


W A (m) 


Proof. Lemma 2.2 shows that the distributions of N q and Sa are 
from the definitions of W q (m) and W A (m), and Theorem 3.1. 


each ILRC; the result follows 

■ 


Observe that the statement of Theorem 3.1 applies more generally to the notion of a random 
“reward” associated with each initial WU. It states that if each initial WU earns a random ILRC 
reward and if the reward to a processor is the sum of the rewards earned by its (independent) 
WU’s, then a stochastic majorization on the rewards follows from a deterministic major, zat, on of 

the initial WUs. 


0ur <Kr results m to require the assumption of ILR or ILRC branching distributions. 
However by constraining our attention to symmetric convex functions we are able to obtain 
orderings for completely general branching distributions. The details, which are numerous, are 
developed in tile Appendix. The counterpart to Theorem 3.2 is 


Theorem 3.3 

let m and m! 


Conside r a set of n tasks, with common nonnegative branching distribution f , and 
be two mapping vectors such that m -< m! . Then 


• For all generations q, W q (m) -< C as W q (m )• 

• For any subset A C IV such that S A has finite mean, 

W A (m) Acas w A {m'). 


3.1 Heterogenous Constraints 

The A'-vector m.,„ = »/K) is majorized by any other vector whose components 

are nonnegative and sum to n. Applied to the assignment problem, this shows that the obvrous way 
to balance workload is indeed the best, even for complex stochastic tasks. Optimality is less clear, 
however, if the obvious assignment is prohibited by constraints. For each processor . let C, be an 
upper bound on the number of WUs the processor may be given. Such constraints might arise, or 
instance, if the processors have different memory capacities. The obvious mapping is prohibited if 
any Ci < n/ A' . Majorization provides a way to identify the best assignment of complex stochastic 

tasks even in the face of such constraints. 
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Consider any feasible vector y = y K ), y x < C’ for * = 1 , A'. Suppose there exist i 

and j such that % > y, + 1, and yi + 1 < C\. Construct a new vector x from y by transfering one 
unit fiom y 3 to y„ i.e., xj = tjj - 1, x, = 2/, + 1, x k = y k for all k / i,j. It is shown in [10] (5.D) that 
x ^ y. This observation gives a rule by which we can iteratively improve a feasible solution, until 
no further improvement is possible. We say a vector x resulting from this processed is balanced. 

Without loss of generality assume that C*, < C 2 < ••• < C K . It is apparent that x is balanced 
if and only if whenever x, > x t + 1, then x t = Q. A characterization of balanced vectors then 
is that there is some index j such that x, = Q for i = , and for all l,m > j we have 

l J / - -'-ml < 1. Furthermore, if x and y are both balanced, then this index j is the same for both 
of them. It follows then that x -< y and y < x, which shows the essential uniqueness of balanced 
vectors. Balanced vectors are thus optimal under heterogenous constraints. 

A simple O(n) algorithm will construct a balanced assignment. Assume the processors are 
ordered by increasing constraint value, and initially set x, = 0, i = 1, 2, . . ., K. We loop repeatedly 
over indices 1 to A . Each pass through the loop we increment x, once, provided x, < C,. This 
essentially assigns one unit to the processor. We repeat the loop until all n units are assigned. 

The main results of these section show that stochastic branching preserves stochastic majoriza- 
ti°n for ad(htlVe reward systems. As we have seen, useful reward systems are derived from the 
generation sizes. The section to follow illustrates how these results can be fruitfully applied to 
various objective functions. 


4 Objective Functions 

We will now establish that a number of interesting objective functions are either Schur-convex or 
convex symmetric functions of some notion of workload. These objective functions include finishing 
time under different synchronization schemes, the space-time product, and overall reliability. This 
diversity of application demonstrates the utility of the theory. 


4.1 Finishing Time 

One use of majorization is to show that whenever m -< m', the computation’s expected finishing 
time under m is better than that under m'. This can be established using different models of 
execution. For example, one easily envisions a computation where the tasks must synchronize 
globally after every generation, i.e., GS synchronization. This is typical of tasks associated with 
numerical computations. If the WUs each have unit execution time, then max*{(W,(m)) . } time 
is required under mapping m to execute all generation q WUs. N q can be viewed as a random 
reward associated with an initial WU, thus Theorem 3.3 tells us that W q {m) < cas W q {m'). The 
max operator is convex and symmetric, whence E [max,{(lV,(m)),}] < E [max4(VF,(m')) fc }]. 
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This same result holds true if the WU execution times are random, and i.i.d. . Since the time 
between each synchronization is no larger under m than than under ra', it follows that the overall 
finishing time is no larger. 

Similar results are obtained under TS synchronization , where processors synchronize only at 
termination. The reward for an initial WU can be defined to be the total size of the branching 
tree rooted in that WU. When the mean of the branching distribution is strictly less than one, then 
£[,SV] < oo. In this case, whenever m -< m\ the expected maximum processor reward under m 
is no larger than under m * . Even when the branching distribution’s mean is greater than or equal 
to one (but is finite) we can always assert that the time to execute all generations up through q is 
no greater under m than it is under m \ by defining the reward for an initial WU to be the sum of 
the sizes of generations through q. Any symmetric convex function of the processor rewards — such 
as the maximum processor reward — yields an -< cns ordering. 

Another metric of interest is the variation in the time to synchronize. The sample variance, 
defined below, is also symmetric and convex. 

SampleVar(x) = 


(£<-*>■) 
i> if -- 2 


where x — (^2 x i)/ n - Thus, 

S ample V ar(W (J (m)) -< cas SamplcVar(W q (m')) 


for any generation r/, and 


S a / / ipl c V a r ( W 4 ( m ) ) -< ca s S ample V ar( W q (m f )) 

for any A C iV such that $a has finite mean. When the branching distribution is ILR, a similar 
result holds true for the sample standard deviation (square root of variance) of time between 
synchronizations, because the standard deviation is Schur-convex ([10], pp. 71). 


4.2 Functions of Queue Length 

When a WU completes its execution it generates its children and places them on the processor’s work 
list. Following this, another WU is selected to be executed. There is thus a storage cost associated 
with executing complex tasks; more generally, we show here how stochastic majorization can be 
applied to objective functions based on measuring queue lengths at every time step. A simple 
example of this is the computation’s total space-time product, defined as follows. Let Q(t) be the 
vector enumerating the number of WUs enqueued at each processor at time t, and let T be the 
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computation’s termination time. Then the total space-time product is YlI=o(Q(t))k' This 

idea can be generalized — let s(j) quantify the cost of holding j WUs in queue for one unit of time. 
Then the total space-time cost with respect to s is Jlj=z0 We W *U show that if s 

is increasing convex with .s(0) = 0, and if m -< m', then under TS synchronization the expected 
space-time cost with respect to s is no worse under m than it is under m f . This result is also 
demonstrated for GS synchronization when the branching distribution is ILR. 

Under the model assumptions we have made, the probabilistic behavior of a processor’s queue is 
completely independent of the queueing discipline used. We will assume that the queueing discipline 
is Smallest- Generation- First (SGF): whenever a processor selects a WU for execution from its work 
list, it chooses one with least generation index. For simplicity, we also assume that the execution 
of a WU takes unit time. 

The space-time function s(k) = k gives rise to the usual space-time product, but other space- 
time cost functions are also intuitive. For example, one might have to store WU states on disk 
whenever the queue length exceeds a threshold L ; furthermore, once L is exceeded the cost might 
be superlinear, owing to fragmentation costs. A candidate cost function would be 


s 


(0 = 


0 if k < L 

( L - k) l+t if k > L 


where t > 0. The general assumptions that a space-time cost function be convex, increasing, and 
zero for empty queue lists seem to us quite natural. 

Our treatment of space-time costs under TS synchronization hinges on the following observa- 
tion: if processor k has exactly (W q (m)) k WU units in generation g, then under the SGF queueing 
discipline at some point in time the processor's queue will have exactly ( W q (m)) k WUs. In partic- 
ular, at the instant where the first WIJ of generation q is about to be executed, the queue consists 
entirely of generation q WUs, and contains all of them. We will show that the contribution to the 
expected space cost made by processor k while processing generation q WUs (under SGF schedul- 
ing) is an increasing convex function of ( W q (m)) k , and use this fact to find a majorization on the 
vector of expected contributions made by all processors while processing generation q WUs. This, 
in turn, will show that the total expected space- time cost under m is no worse than under m 
when the expectations exist. This is a -< cas result, applicable for any branching distribution. 

Suppose (Wq(m)) k = r. The processing of the i ih WU in generation q (i = 1 ,. . .,r) produces 
a random number X q>l of WU units, who join the processor’s queue. The queue length at the 
instant the i t{i WU begins execution is r — (i — 1) + as there were r work units in queue 

at the point the first generation q WU was executed, i — 1 of them have been executed, and each 
one produced a random number of generation q + 1 WUs. Therefore, the conditional expected 
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space-time cost suffered during the processing of this WU is 

1-1 

<j> is convex in r, because for any convex 7 and random variable Z, the expectation £[7(0 + Z)\ 
is convex in « (assuming the expectation exists). The expected space-time cost of processing all r 
111 embers of generation q on processor k is 

r 

C a (r) = 

t = l 

Finally, we claim that C s (r) is a convex function of r. To demonstrate this it suffices to show that 
C s ( r +2)+(' s (r) > 2(' s (r+ 1 ) for all r. Since cj> is convex in r we have cj)(j , r + 2 )+</>( J, r) > 2cf)(j,r+ 1 ) 
for all j = 1, . . r. This observation reduces the problem to a demonstration that 

<j){r + 2, r + 2) + <f>(r + 1 , r + 2) > 2(j>{r + 1 , r + 1 ). 

The fact that .s(r) is increasing establishes that both </»(r + 2, r + 2) and <j>(r+l,r + 2) dominate 
<f){ r + 1, r + 1), thereby proving the convexity of C s (r). 

The function T s (r u . ..,»*) = Ete=i C t (r k ) is symmetric and convex on N h \ because whenever 
g is convex on R then h(x) = *s convex on R K . Observe that T S {W q {m)) is the random 

space-time cost with respect to * and generation q resulting from assignment vector m. We have 
proven the following result. 

Proposition 4.1 Let .s be an increasing convex function with s(0) = 0 and suppose the space cost 
of holding k WUs in one processor’s queue for one time unit is s(k). Define 

T s {W q {m)) = '£C s ((W q (m)) k ) 
k=\ 

to measure, the space-time cost suffered while executing generation q, under the assignment given 
by to. Then whenever m < m l , 

. E[T s {W q {m))} < E[T S (W ,(*»'))] for q = 0,1,.... 

• The expected total space-time cost using TS synchronization is no worse under m than under 
m* : 

00 00 

T s (W q (m))] < E[J2 Z{W q {m'))] whenever the expectation exists. 

q = 0 9=0 
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The analysis of space-time costs under CIS synchronization requires more work, and the as- 
sumption of an ILR branching distribution. Suppose that (W q (m)) k - r k , for k = The 

space-time cost to processor k during the interval of time when generation q WUs are executed has 
two components. We have already seen the first: C(r k ) -the cost accumulated over the period of 
length / k while generation q WUs are executed. The second component is the space-time cost suf- 
fered waiting for the most heavily loaded processor to finish. If processor k generates x generation 
<7+ 1 WUs, then the space-time cost it suffers waiting at the barrier is (max i {r i }-r i )s(z). Recalling 
the definition of <f> (equation (2)) we may write the expected total space-time cost of processing 
generation q WUs under GS synchronization (conditioned on (W q {m)) k = r k , for k = 1, . . ., A') as 

G(r u ...,r K ) - 2 ' r k ) + ( m ax { r 3 } - r k )<p(r k + l,r*)) , 

k=\ \i=l J ) 


Observe that (/>(r k . + l,r k ) is £[s( A (’+)], where X is the branching random variable . Q is Schur- 


convex on JV A , a fact we show using the following characterization of Schur-convex functions on 
1V A (3.A.2.b in [10]). 


A function o on IN h is Schur-convex if and only if a is symmetric and 

«(? - i r : 3, . . . , tk) is increasing in rq > t /2 

for each fixed t , , r :i , . . . , . 

Fix r.j, . . . , rtf, and consider rq > r 2 . If the difference G(r\ + 1 , r 2 - 1, . . r^) — Q(ri , r 2 , . . . , r/c) is 
always nonnegative, then the condition above tells us that Q is Schur-convex. We need to examine 
two cases, maXjl?-.?} = ?q, and the alternative. Assuming the former, straightforward algebra shows 
that the difference is bounded from below by 


n r 2 -l 

?, i + 1 ) - rq )] - X) r i) ~ r 2 - 1)] + 

>=i ,=i 

(^( r i + ! ’?q + 1) - </>{r 2 ,r 2 )) - (r, - r 2 )((jy(r 2 + l,r 2 ) - 4>{r 2 ,r 2 - 1)). 

Both of the two summations above are positive, because <f)(i, r ) increases in r. Since r ) is convex 
in r and ?q > r 2 , it also follows that <f>(i, rq + 1) - (j>(i, rq) > r 2 ) - r 2 - 1) for every i. Thus 

the positive summation above dominates the negative summation, and the desired inequality will 
hold if 

(0(rq + 1, rq + 1) - </)(r 2 , r 2 )) - (n - r 2 ) (<j>{r 2 + 1, r 2 ) - <j>(r 2 , r 2 - 1)) > 0. 

Since <f>{r,r) is a convex function of r, we have 


n + i —?*2 

f/H?q + 1 , r, + 1) - </>(r 2 ,r 2 ) = X (tffo + i, r 2 + ») - (j,(r 2 + i - 1 , r 2 + i - 1)) 

! = 1 
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n +1-7*2 

> ^ (ej>(r 2 + 1, r 2 + 1 ) - (fr(r 2 , r 2 )) 

i = 1 

= (r, - r 2 )(ej>{r 2 + l,r 2 + 1) - ^(r 2 ,r 2 )). 

From this inequality we see that the desired bound will hold if (</>(r 2 + l,r 2 + 1) - <j>(r 2 ,r 2 )) > 
(<p(r 2 + 1 , r 2 ) - (j>{r 2 , r 2 - 1)) . The convexity of s implies that 

<!>(r 2 + l,r 2 + 1) - <i>( r 2 , v 2 ) = E[s( 1 + X™) - s(l + * (rj-1) )] 

> £[s(A(r 2 )-.s(A(r 2 - 1))] 

= (f>{r 2 + l,r- 2 ) - < t>{r 2 ,r 2 - 1), 


as needed. 

The argument for the case when n t maxjfo} is almost exactly the same, and so is omitted. 
The Sclmr-eonvexity of Q gives us a stochastic majorization for GS synchronization. 

Proposition 4.2 Let. .s be an increasing convex function with .s(0) = 0, suppose, the space cost 
of holding k Wlh in one processor's queue for one time unit is s(k), and suppose the branching 
distribution is ILIi. Define 


K 


T'k 


Q{r u ...,r K ) = 53 \ + {m^iri} - r k )et>{r h + l,r fc )J , 

k = 1 \«=l 3 1 

to measure, the space-time cost with respect to s of executing some generation q under GS synchro- 
nization, where the each processor i has r< generation q WUs. Then Q is Schur-convex on N h , so 
that whenever rn < m* , 


• E[G(W v {m))] < £[c?( W,,(m))] for q = 0, 1, ... . 

• The expected toted space-time cost using GS synchronization is no worse under m than under 
m f : 

00 00 

E[^2 £( W,(m))] < E(%2 G{W q (m))] whenever the expectation exists. 

< 7=0 9=0 


4.3 Reliability 

Yet another application of majorization is to the question of whether the hardware will successfully 
execute the entire computation. We suppose that the computation “fails” if any processor having 
a non-empty queue fails. Observe that this definition permits the computation to successfully 
complete even if a processor dies before the entire computation is finished, provided the failing 
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processor is itself already finished. We will show that if the branching distribution is ILR and a 
processor’s time- to- failure distribution has an increasing hazard rate function, the the probability of 
failure under rn is no greater than that under m / , whenever m -< m! . Conversely, if the branching 
disti ibution is ILR and the processor failure distribution has a decreasing hazard rate function, then 
the reliability under rn f is better than that under m. The result is proven for TS synchronization. 

Suppose that piocessot i s time to failure is the random variable Z with an monotone hazard 
rate function X(u). It is well known that 

?r{Z > t} = exp{— f X(u) cLs}. 

Jo 

If A( u) is nondecreasing in u, then - / 0 * A(u) du is concave in which is to say that log Pr{Z > t } 
is concave. Conversely, if A(w) is decreasing, then logPr{Z > t} is convex. 

It follows (3.L.1 in [10]) that when A(u) increases, the product 

K 

(3) 

2 — 1 

is Schur-concave, or equivalently, that -IZ(t u ...,t K ) is Schur-convex. When A(u) decreases then 
TZ(t\ , . . . , tft) is Schur-convex. 

If processor i is assigned m, WUs initially, it ends up processing S% li) WUs total. This is also 
processor Vs processing time under the assumptions of SGF sc.hed»iling, TS synchronization, and 

unit execution cost per WU. Given S^ n,) = t k for i = 1 ,..., K, equation (3) gives the probability 
that every processor executes all WUs without processor failure. The unconditional probability is 
obtained by taking the expectation with respect to the joint distribution of S^(m): 

Pr {every processor executes all its WUs before failing} = E[H(S n(rn))). 

Lemma 2.2 asserts that is lLR(i if the branching distribution is ILR. It follows from Theorem 3.1 
that when A (u) is increasing, £’[7£(5//(m))] is a Schur-concave function of m. This proves the 
following proposition. 

Proposition 4.3 Suppose the hazard rate function A( u) for the time to processor failure is increas- 
ing, and suppose the branching distribution is ILR. Let 7 (m) be the. probability that every processor 
executes all its WUs without processor failure. Then under TS synchronization and SGF scheduling, 
whenever m -< m' we hate j(rn) > 7 (m') The inequality is reversed if \(u) is decreasing. 

5 Assignment of Processor Pools 

Our last application of stochastic inajorization concerns a problem where a large number P of 
processors are to be partitioned among a smaller number T of complex tasks. Parallel processing 
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can be applied to the tasks to accelerate execution time. We assume that a task requires fat a 
of its generation i WUs to be executed before any of its generation t + 1 WUs are, but that a 
generation i WUs may be processed in parallel. As before, the overall system may use e.the, TS 

or GS synchronization. 

Let g(X, „.) give the time required by m processors to execute X WUs. We assume that .,( V, m) 
is convex in m, e.g., g{X,m) = X/m, and that £f(0, m) = °- 

Suppose there are K initial WUs. We may describe our assignment of n ,»rocessors to these 
WUs with vector m, whose component gives the number of processors assigned to the , ‘ W L 
Also let AVt denote the random number of WUs associated with generat.on q of task t. Ituler t.S 

synchronization, the time required to complete the q th generation is 

7,( m) = max{ g( A r ,,i ,m x ), g(N q< - 2 , m 2 ), ■■■,g{X q j<, m K ) } • 

Under our assumptions, E^m)) is a symmetric convex function of m (B.d proposition in (10]) 
showing that (hH < Cfm'1] whenever m < or', It follows immediately that the overall 
expected finishing time under GS synchronisation is no worse under m than under m . 

Under TS synchronization the finishing time is 

oo 00 

p(m) - imx{y2g{N qt \,ini),...,^29(X q ,K,nik)}- 
0 7 =° 

A sum of convex functions remains convex, whence E(p(m)] is symmetric and convex in m. When 
m < m . we are assured that the expected finishing time nnder TS synchronization ,s no worse 

using ui than it is with m . 


6 Conclusions 

This paper explores the application of inajorization to the problem of assigning a large numbei of 
stochastically complex (bn. probabilistically identical) tasks onto a multiprocessor. Using a 
of workload based on branching processes, the theory we develop establishes a partial S 

among possible assignment of tasks to processors. We show that the quality of an initial assign, non 
persists through stochastic transformations of the workload, and that the ordering can he taker, 
with respect to a wide range of objective functions including those measuring finishing time, space- 
usage, and reliability. We also show how the theory applies to the processor partitioning pro > e,m 
The utility of the theory lies in the generality of the objective functions that can be considered, a, 
in the fact that optimal solutions can be identified even when constraints are placed on potent, .,1 

assignments. 
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A Appendix 


In this appendix we prove some claims made earlier in the paper. 

The ILRC condition upon which our results depend involves the notion of totally positive 
functions. Chapter 18 of [10] is the source for the following definition. 


Definition A. 1 (Totally Positive Function) Let A and B be subsets of the real line. A function 
a: Ax B-* JR is said to be totally positive of order k, denoted TP k , if for all m, 1 < m < k and 
all x, < x 2 < . . . < x m , 2/1 < 2 / 2 < ... < y m ( X{ e A,yj e B) 


n (xi,y\) ■■■ a(x l ,y m ) 
. ■■■ ot(x, H ,y m ) 


We will use the following result (18.A.4.ain [10]). 

Lemma A.l If h is Tf m and L is TP n , and a is a o-finite measure, then the convolution 

M (x, V) = J A'(x, z)L(z, y)do(z) 

^ ^ ^ min {m,n} • 

The relationship between total positivity and ILRC distributions is direct. Given any integer- 
valued nonnegative probability mass function / we may define the function a f : W x N [0, 1]: 

«/(*»*) = f {,] (x). 

a j is T P 2 iff 

f^(n)f U \m) > /«( m) /C) (n) 

for all i < j, m < n. But this is equivalent to saying that /(*') < /r /W, i. e ., that / is ILRC. 

The reason for our interest in ILRC distributions / is that their convolution functions a, satisfy 
three criteria required by Theorem 3.J.2 of [10] 

• aj(x,y) = 0 whenever y < 0; 

• ft/ is totally positive of order 2; 

ft/(x + ^,i/) _ fn/(x, u)a(z,y — u)dv(u), for some measure v on IV. 


18 



Theorem 3..).2’s conclusion is that if m 
4> : is Sc.hur-convex, then 


mK ) e n is counting measure, and 



t=i 


(4) 


is Sclmr-convex on . Theorem 3.1 in a restatement of this result, where V« 3„(») . 1; because 
«,(»„»,) is a probability, we recognize that *(»)) expresses the expected value of «y). 


-< cns Results 

We next consider the ordering. In this case, we are able to obtain the analogue of Tb»™ M. 
save that the result holds for completely general branching distribution., 

introduce a little more terminology, and develop an intermediate result. 

A random vector X . (X„. ...X.) is said to have components « the dis- 
tribution of X, X„ is invariant under permutations of its components. Our basic 

rest on the following observation. 

Lemma A 2 Let X.Y he .,„«nc 9 „t.»c modem «*** and Z = (Z„Z 2 ) k « »ndo„. vector 

mth uotinqiativt cxihnnytabU </■»< X, X -< X « ™ 

define U = (X.Y) and V = (X + Y, 0). Then 

7. 4 - II Z + V. 


Proof Let : Hi - K be a convex symmetric function. Define the function t : K + - * 
as i/t(o) . £WZ + a)], Va € K» + . Since Z has exchangeable components, V is also a convex 

symmetric function. 

Now U -< V a.s. from which it follows 

xj,(U) < ij’(V) => E[iliU)} < E[i’{V)}, 

=» E[0(Z + t/)]< E[fi(Z + V)}, 

=> Z + U < cas Z + V. 

■ 

The result extends easily to R\. 
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Lemma A.3 Let X ,Y be any nonnegative random variables, let Z = ( Z, Z> ■ ■ ■ Z ) P IR n h 

^ V = (* + K, o, ZrZT* * 2 " "““ UaHy independent a " d **" t/ = C^.V,0,...,0) 

z + 'J z+v. 


fi ' *' R ’ ' Sym " ,etr,c convcx faction. Now, „i is symmetric and convex in the 

filst two arguments. Therefore, we can condition the values of Z i < i < K to he 1 
the previous lemma to obtain S < , < h to be ay and apply 

- **1 S = A,,-,Z*. » w , 

Removal of the conditioning on Zy, 3 < ; < K yields the desired result. , 

We are now prepared to prove Theorem 3.3. Let m' be any mapping vector where there are 
censors ,, , such that m ( > mj. Without loss of generality we may take i = 1 and j = 2 and 
let m be the mapping vector obtained from m' by moving one WU front processor 1 to processor 
e w, appy emma A.2. Interpret Z,,Z 2 as mj-fold convolutions of initial WU rewards V 
as the convolution of mi - ,n' 2 - I initial WU rewards, Y as a single initial WU reward, and eLch 
Z for; > 2 « the convolution of m' initial WU workloads. The appUcation of lemma A.3 y,I 
R {™ ) <cas R(m'). 

The incremental movement of a task from a heavily loaded processor to a more lightly loaded 
I ssot corresponds to the more general notion of a “transfer" [10], It is known that whenever 

vector’ isT X CM 1 ,C . CO " Str, ' Cte<l from y with a ,l " il " "“'"'’C of transfers, where each transformed 
; c or always dominated under A by its predecessor. Consequently if m' is a mapping vector 

with m , then one demonstrates that W(m ) -< W(m r I thrrm«rt. . , ' 

Lemma A 1 tn tl, a , J Kca * W ( ' throu S h a repeated appheation of 

Lemma A.3 to the sequence of transfers that transmute m' into m. This proves the result. 
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